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Abstract 

This paper explores Maximum Likelihood in parametric models in the context 
of Sanov type Large Deviation Probabilities. MLE in parametric models under 
weighted sampling is shown to be associated with the minimization of a specific 
divergence criterion defined with respect to the distribution of the weights. Some 
properties of the resulting inferential procedure are presented; Bahadur efficiency 
of tests are also considered in this context. 

1 Motivation and context 

This paper explores Maximum Likelihood paradigm in the context of sampling. It mainly 
quotes that inference criterion is strongly connected with the sampling scheme generating 
the data. Under a given model, when i.i.d. sampling is considered and some standard 
regularity is assumed, then the Maximum Likelihood principle loosely states that condi- 
tionally upon the observed data, resampling under the same i.i.d. scheme should resemble 
closely to the initial sample only when the resampling distribution is close to the initial 
unknown one. 

Keeping the same definition it appears that under other sampling schemes, the Max- 
imum Likelihood Principle yields a wide range of statistical procedures. Those have in 
common with the classical simple i.i.d. sampling case that they can be embedded in a 
natural class of methods based on minimization of 0— divergences between the empirical 
measure of the data and the model. In the classical i.i.d. case the divergence is the 
Kullback-Leibler one, which yields the standard form of the Likelihood function. In the 
case of the weighted bootstrap, the divergence to be optimized is directly related to the 
distribution of the weights. 

This paper discusses the choice of an inference criterion in parametric setting. We 
consider a wide range of commonly used statistical criterions, namely all those induced 



by the so-called power divergence, including therefore Maximum Likelihood, Kullback- 
Leibler, Chi-square, Hellinger distance, etc. The steps of the discussion are as follows. 

We first insert Maximum Likelihood paradigm at the center of the scene, putting 
forwards its strong connection with large deviation probabilities for the empirical measure. 
The argument can be sketched as follows: for any putative 9 in the parameter set, consider 
n virtual simulated r.v's Xi t $ with corresponding empirical measure P n ^. Evaluate the 
probability that P n ^ is close to P n , conditionally on P n , the empirical measure pertaining 
to the observed data; such statement is refered to as a conditional Sanov theorem, and 
for any 9 this probability is governed by the Kullback-Leibler distance between Pq and 
Pq t where 9t stands for the true value of the parameter. Estimate this probability for 
any 9, obviously based on the observed data. Optimize in 9; this provides the MLE, as 
shown in the two cases of the i.i.d. sample scheme; our first example is the case when the 
observations take values in a finite set, and the second case (infinite case), helps to set 
the arguments to be put forwards. Introducing MLE's through Large deviations for the 
empirical measure is in the vein of various recent approaches; see Grendar and Judge [7j. 

We next consider a generalized sampling scheme inherited from the bootstrap, which 
we call weighted sampling; it amounts to introduce a family of i.i.d. weights W±, ...,W n 
with mean and variance 1. The corresponding empirical measure pertaining to the data 
set X\ , .., x n is just the weighted empirical measure. The MLE is defined through a similar 
procedure as just evoqued. The conditional Sanov Theorem is governed by a divergence 
criterion which is defined through the distribution of the weights. Hence MLE results 
in the optimization of a divergence measure between distributions in the model and the 
weighted empirical measure pertaining to the dataset. 

Resulting properties of the estimators are studied. 

Optimization of 0— divergences between the empirical measure of the data and the 
model is problematic when the support of the model is not finite. A number of authors 
have considered so-called dual representation formulas for divergences or, globally, for 
convex pseudodistances between distributions. We will make use of the one exposed in 
[3]; see also [TJ for an easy derivation. 

1.1 Notation 
1.1.1 Divergences 

The space S is a Polish space endowed with its Borel field B (S) . We consider an iden- 
tifiable parametric model Ve on (S, B(S)), hence a class of probability distributions Pg 
indexed by a subset O included in M d ; O needs not be open. The class of all probability 
measures on (S,B(S)) is denoted V and Ai(S) designates the class of all finite signed 
measures on (S,B (S)) . 

A non negative convex function (p with values in R+ belonging to C 2 (1R) and satisfying 
ip (1) = <p' (1) = and <f" (1) is a divergence function. An important class of such functions 



2 



is defined through the power divergence functions 



¥1 \x) := 1.1 

7(7-1) 

defined for all real 7 ^ 0,1 with (po (x) := — logx + x — 1 (the likelihood divergence 
function) and tpi (x) := xlogx — x + 1 (the Kullback-Leibler divergence function). This 
class is usually refered to as the Cressie-Read family of divergence functions, a custom we 
will follow, although its origin takes from |T2]. When x is such that tp 1 (x) is undefined 
by the above definitions, we set ip 1 (x) := +00, by which the definition above is satisfied 
for all tp T It consists in the simplest power-type class of functions (with the limits in 
7 — > 0, 1) which fulfill the definition. The L\ divergence function </? (x) := \x — 1| is not 
captured by the Cressie-Read family of functions. 

Associated with a divergence function tp is the divergence pseudodistance between a 
probability measure and a finite signed measure; see jl]. 

For P and Q in Ai define 

<p (Q, P) '■= J p ^^p^j ^ wnenever Q is a - c - w.r.t. P 
:= +00 otherwise. 

The divergence <p (Q, P) is best seen as a mapping Q — > <ft (Q, P) from Ai onto M+ for fixed 
P in Ai. Indexing this pseudodistance by 7 and using tp 1 as divergence function yields 
the likelihood divergence <f>o(Q,P) := — J log (^) dP, the Kullback-Leibler divergence 

0! (Q,P) := / log (§) dQ, the Hellinger divergence 1/2 (Q,P) := \ j (j§ - lj dP, 

the modified x 2 divergence 0_i (Q, P) := ~ J — l) 2 (^) 1 dP. All these divergences 

are defined on V. The x 2 divergence 2 (Q, P) := |J (^p — l) 2 dP is defined on A^. We 
refer to [3] for the advantage to extend the definition to possibly signed measures in the 
context of parametric inference for non regular models. 

The conjugate divergence function of <p is defined through 



.r 



(p(x):=xtp[~) (1.2) 



and the corresponding divergence pseudodistance (P, Q) is 
which satisfies 

<l>(P,Q) = <f>(Q,P) 
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whenever defined, and equals +00 otherwise. When tp = tp^ then tp = <^i_ 7 as follows 
by substitution. Pairs (<pj, <£>i- 7 ) are therefore conjugate pairs. Inside the Cressie-Read 
family, the Hellinger divergence function is self-conjugate. 

In parametric models tp— divergences between two distributions take a simple varia- 
tional form. It holds, when tp is a differentiable function, and under a commonly met 
regularity condition, denoted (RC) in [T] 

m, »,) - s / ,< (g) « - J v * (g) m T (1.3) 

where <p#(x) := xtp' {x) — tp(x) . In the above formula, U designates a subset of containing 
9t such that for any 9, 9' in U, <ft {Pe, Pe 1 ) is finite. This formula holds for any divergence 
in the Cressie Read family, as considered here. 
Denote 

from which 

4>{Pq, Pe T ) ■= sup / h(9,a,x)dP dT (x). (1.4) 

For CR divergences 



h(9, a, x) = 




1.1.2 Weights 

For a given real valued random variable W denote 

M(t) := log EexptW (1.5) 

its cumulant generating function which we assume to be finite in a non void interval 
including (this is the so-called Cramer condition). The Fenchel Legendre transform of 
M is also called the Chernoff function and is defined through 

p w (x) = M*(x) := sup tx - M(t). (1.6) 

t 

The function x — > ip (x) is non negative, is C 2 and convex. We also assume that EW = 1 
together with VarW = 1 which implies ip w (l) = (p w )' (1) = and (cp w )" (1) = 1. 
Hence (p w (x) is a divergence function with corresponding divergence pseudodistance <j) w . 
Associated with ip is the conjugate divergence <p w with divergence function ip w , which 
therefore satisfies 

<P W (Q, P) = 0^ (P, Q) . 
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1.1.3 Measure spaces 



This paper makes extensive use of Sanov type large deviation results for empirical mea- 
sures or weighted empirical measures. This requires some definitions and facts. 

The vector space A4(S) is endowed with the r— topology, which is the coarest making 
all mappings Q — > J fdQ continuous for any Q G Ai(S) and any / G B(S) which denotes 
the class of all bounded measurable functions on (S, B (S)) . A slightly stronger topology 
will be used in this paper, the r topology, introduced in [5] , which is the natural setting 
for our sake. This topology can be described through the following basis of neighborhoods. 
Consider *p the class of all partitions of 5* and for k > 1 the class of all partitions of S 
into k disjoint sets, Vk '■= (A±, A^) where the A^s belong to B (S) . For fixed P in Ai, 
for any k, any such partition Vk in ^ and any positive e define the open neighborhood 
U (P, e, V k ) through 

U (P,e,V k ) ■= \ Q e M such that max |P(A) - Q{A t )\ < e and Q{Ai) = if P(Ai) = 

l<i<fc 

The additional requirement Q(Ai) = if P(Ai) = in the above definition with respect 
to the classical definition of the basis of the r— topology is essential for the derivation of 
Sanov type theorems. Endowed with the r — topology, M. is a Hausdorff locally convex 
vector space. 

The following Pinsker type property holds 

see |B]. 

For any P in yVf the mapping Q — > <p(Q,P) is lower semi continuous; see [2], Propo- 
sition 2.2. Denoting (a, b) the domain of (f whenever 

hm = hm = +oo 

x ^ a X x^tb X 

x>a x<b 

then for any positive C, the level set {Q : <fi(Q,P) < C} is r — compact, making Q — > 
4>(Q, P) a so-called good rate function. Divergence functions ip satisfying this requirement 
for example are y? 7 with 7 > 1; see [2] for different cases. 



1.1.4 Minimum dual divergence estimators 

The above formula (11 .3j) defines a whole range of plug in estimators of <f)(Pe, Pe T ) and of 
9t- Let Xi, ...,X n denote n i.i.d. r.v's with common didistribution Pq t . Denoting 



n 



P n ■= - ; 

n 

i=l 
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the empirical measure pertaining to this sample. The plug in estimator of <p(Pg,Pg T ) is 
defined through 

4> n (P e ,Pe T ) := sup / h(9,a,x)dP n (x) 
aeu J 

and the family of M-estimators indexed by 9 



q„ (9) := argsup / h(9, a, x)dP n (x) 
aeu J 

approximates 6t- In the above formulas U is defined after (11.31) . See [3] and [13] for 
asymptotic properties and robustness results. 

Since <p(Pg T ,Pg T ) = a natural estimator of 9t which only depends on the choice of 
the divergence function (p is defined through 



9 n := arginf n (P e , Pg T ) 



= arg inf sup / h(9, a, x)dP n (x) ; 
0e« a&u J 



see [3] for limit properties. 



2 Large deviation and maximum likelihood 

2.1 Maximum likelihood under finite supported distributions 
and simple sampling 

Suppose that all probability measures Pg in Vq share the same finite support S := 
{1, k} . Let X±, ...X n be a set of n independent random variables with common prob- 
ability measure Pg T and consider the Maximum Likelihood estimator of 9? ■ A common 
way to define the ML paradigm is as follows: For any 9 consider independent random 
variables (Xi t g, ...X ni g) with probability measure Pg , thus sampled in the same way as the 
Xi 's, but under some altermative 9. Define 9ml as the value of the parameter 9 for which 
the probability that, up to a permutation of the order of the X^g's, the probability that 
(X lt g, ...X n fi) occupies S as does X±, ...X n is maximal, conditionaly on the observed sam- 
ple Xx, ...X n . In formula, let a denote a random permutation of the indexes {1,2, ...,n} 
and 9ml is defined through 

9 ML := argmax^ J^P e ( (X CT(1)>e , ...,X CT(n)ie ) = (X 1 ,...X n )\(X 1 ,...X n )) (2.1) 
where the summation is extended on all equally probable permutations of {1, 2, n} . 
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Denote 

n 

n ^-^ 

i=l 

and 



n 

i=l 



the empirical measures pertaining respectively to (Xi, ...X n ) and (-X^g, ...X n> g) 
An alternative expression for 9ml is 

6>ml := arg max P e (P n ,e = Pn| Pn) ■ (2.2) 



An explicit enumeration of the above expression Pg (P n ,e = P n \ Pn) involves the quan- 
tities 

rij := card {i : Xi = j} 

for j = 1 , . . . , k and yields 

A' 



T',!/U/)' 



Pe (P n ,e = Pn\ P n ) = — ; (2.3) 



as follows from the classical multinomial distribution. Optimizing on 9 in (12. 3 j) yields 

fc 

= arg max V" — log Pg (j) 
e z — 4 n 
j"=i 

1 " 

= arg max - log P e (JQ . 

g n ^— ' 



e n 



Consider now the Kullback-Leibler distance between and P n which is non commutative 
and defined through 



KHP n ,P e) :=P(^)p 



U) 



= ^K./n)log^ (2.4) 

where 

<fi(x) := xlogx — x + 1 (2.5) 
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which is the Kullback-Leibler divergence function. Minimizing the Kullback-Leibler dis- 
tance KL (P n , P e ) upon 9 yields 



9 KL = arg min if L (P n ,P e ) 



= arg min — — log Pq (j) 



k 



= arg max — log Pq (j) 




Introduce the conjugate divergence function ip of <p , inducing the modified Kullback- 
Leibler, or so-called Likelihood divergence pseudodistance KL m which therefore satisfies 



We have proved that minimizing the Kullback-Leibler divergence KL (P n , Pq) amounts to 
minimizing the Likelihood divergence KL m (Pq, P n ) and produces the ML estimate of 9t- 
Kullback-Leibler divergence as defined above by KL (P n , Pq) is related to the way P n 
keeps away from Pq when 9 is not equal to the true value of the parameter 9t generating 
the observations Xi& and is closely related with the type of sampling of the X^s. In 
the present case i.i.d. sampling of the X^o's under Pq results in the asymptotic property, 
named Large Deviation Sanov property 



This result can easily be obtained from ( 12. 3 p using Stirling formula to handle the factorial 
terms and the law of large numbers which states that for all j's, nj/n tends to Pe T {j) as 
n tends to infinity. Comparing with (12. 4p we note that the ML estimator 9ml estimates 
the minimizer of the natural estimator of KL(Pq t ,Pq) in 9, substituting the unknown 
measure generating the Aj's by its empirical counterpart P n . Alternatively as will be used 
in the sequel, 9ml minimizes upon 9 the Likelihood divergence KL m (Pq, Pq t ) between 
Pq and Pg T substituting the unknown measure Pq t generating the X^s by its empirical 
counterpart P n . Summarizing we have obtained: 

The ML estimate can be obtained from a LDP statement as given in ( 12.61) . optimizing 
in 9 in the estimator of the LDP rate where the plug-in method of the empirical measure 
of the data is used instead of the unknown measure Pq t . Alternatively it holds 



KL m (P e ,P n ) = KL (P n ,P e ). 



n— >oo fi 



lim -\ogPe(P n> e = P n \P n ) 



KL (Pq t ,Pq). 



(2.6) 



9 M l ■= arg min KL m (P e , Pq t ) 



(2.7) 



with 



KL m (P e , Pq t ) := KL m (P B , P n ) . 
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In the rest of this section we will develop a similar approach for a model V@ whose all 
members Pg share the same infinite (countable or not) support S. 

The statistical properties of 9 ml are obtained under the i.i.d. sampling having gener- 
ated the observed values. 

This principle will be kept throughout this paper: the estimator is defined as max- 
imizing the probability that the simulated empirical measure be close to the empirical 
measure as observed on the sample, conditionally on it, following the same sampling 
scheme. This yilds a maximum likelihood estimator, and its properties a re then obtained 
when randomness is introduced as resulting from the sampling scheme. 

2.2 Maximum likelihood under general distributions and simple 
sampling 

When the support of the generic r.v. X\ is not finite some of the arguments above are 
not valid any longer and some discretization scheme is required in order to get occupation 
probabilities in the spirit of (12. 3p or (12. 6p . Since all distributions Pg in Ve have infinite 
support, i.i.d. sampling under any Pg yields (X^g, ...X n>e ) such that 



for all n, so that we are lead to consider the optimization upon 9 of probabilities of the 
type Pg(P nt g 6 V (P n ) | P n ) where V (P n ) is a (small) neighborhood of P n . Considering 
the distribution of the outcomes of the simulating scheme Pg results in the definition of 
neighborhoods through partitions of S, hence through the r — topology. 

When P n is the empirical measure for some observed r.v's Xi, ...X n , an e— neighborhood 
of P n contains distributions whose support is not necessarily finite, and may indeed be 
equivalent to the measures in the model Ve when defined on the Borel a— field B (S). 

Let Vk '■= (A\, Ak) be some partition in tyk- Denote 



an open neighborhood of P n . 

We also would define the Kullback-Leibler divergence between two probability mea- 
sures Q and P on the partition Vk through 



Pe (Pn,6 — Pn \ Pn) ~ 



V k , e (P n ) : = 



Q e M such that max \P n (Ai) - Q(Ai)\ < e and Q(A { ) = if P n (A) = 



(2.8) 




Also we define the corresponding Likelihood divergence on Vk through 



(KL m ) p (Q,P) := KL Vk (P,Q) . 



(2.9) 
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As in the finite case for any 9 in denote (Xij, ...X n g) a set of n i.i.d. random 
variables with common distribution Pq. We have 

Lemma 2.1. For large n 

1 logP, (P n ,g e V K£ (P n )\P n ) > ~KL Pk (V k}E (P n ) , P 6 " 



n n 

k log(n + 1) 



:= - inf KL Vk (Q, P e ) - 

Q&v k , £ (P n ) n 

Proof. The proof uses similar arguments as in [5] Lemma 4.1. For fixed k and large n, Pg T 
belongs to V k>£ (P n ), by the law of large numbers. Indeed for large n , P n (Aj) is positive 
and \Pq t (Aj) — P n (Aj)\ < e for all j in {1, k} . Assuming that for all 9 in 

KL(P 6T ,P e ) < oo 

and taking into account the fact (see [TT]) that for any probability measures P and Q, 
K(P, Q) = sup fc sup.p fee ip fc KLp k (P, Q) where ^ is the class of all partitions of S in k sets 
in B (S), it follows that 

KL Vk {V k , E {P n ) , P e ) is finite 
for all fixed k and large n. For positive 5 let P^ in Vk, £ (Pn) with 

KL Vk (PW, P e ) < KL Vk (V k , £ (P n ) , P e ) + 5. 

Let < e' < e and non negative numbers rj , 1 < j < k such that 
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k 

P (n) (Aj)\ < e', and rj = if P (n) (Aj) = and = 1. 

3=1 



The probability vector (ri,...,r k ) defines a probability measure R on (S,V k ) , and P 
belongs to Vfe )£ (P re ) . By continuity of the mapping x — > x log Pf ^ A .^ it is possible to fit the 
rj's such that for all j between 1 and k 



r . log; T A p(") (A) loe P(n) 



< ^. (2.10) 



Indeed since all the P^'s share the same support, if Pq (Aj) = then Pg T (Aj) = which 
in turn yields P n (Aj) = which through (I2.8P implies P^ (Aj) = 0. This plus the 
conventions 0/0 = and OlogO = implies that (I2.10p holds true for some choice of 
the Tj's. Choose further the r^'s in such a way that lj := nrj is an integer for all j. Let 
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P n fi denote the empirical distribution of the X^s. We now proceed to the evaluation of 

Pe\Pn,eeV kj£ (P n )\P n ).lt holds 

Pe (P n ,e e V k>£ (P n )\P n ) > P e (P n ,e (A,-) = r v 1 < j < k\ P n ) 

k 

IL ! , 

= ^rITW 



> (n + 1) k exp — n r,- log 



where we used the same argument as in [3], Lemma 4.1. In turn using (I2.10p 

<KL Pk (V k , £ (P n ),P e ) + 25 

and the proof is completed. □ 

The reverse inequality is as in [5J p 790: The set V kj£ (P n ) is completely convex, in the 
terminology of [5], whence it follows 

Lemma 2.2. For all n 

1 



logP, {P n>e e V k>£ (P n )\ P n ) < -KL Vk (V k , £ (P n ) , P e ) 



Lemmas 12.11 and 12.21 link the Maximum Likelihood Principle with the Large deviation 
statements. Define 

9 ML := arg max - log Pg ( P nfi G V kfi (P n ) \ P n ) (2.11) 

and 

Oldp ■= argmin -KL Vk (V k , £ (P n ) , Pe) 

6 

assuming those parameters defined, possibly not in a unique way. Denote 

L k ,e (0) := - logPg {P nfi E V k , £ (P n )\Pn) 

n 

and 

K k>e {6):=-KL Vk (V kj£ (P n ),Pe). 
11 



We then deduce that 

-- log (n + 1) < L Ke (8ml) ~ K Ke (8 ML ) < 
n 

< -L k>£ (6 LDP ) - K k , £ (6 LDP ) < - log (n + 1) 

n 

whence 

< L k , e (6ml) - L Ke (0 LDP ) < - log (n + 1) (2.12) 

n 

from which 8ldp is a good substitute for 9ml for fixed and e in the partitioned based 
model. Note that the bounds in (12.12p do not depend on the peculiar choice of Vk in tyk ■ 
Fix k = k n such that lim n ^oo k n = oo together with lim^oo k n /n = 0. Define the 
partition Vk such that P n (Aj) = k n /n for all j = 1, fc. Hence Aj contains only k sample 
points. Let e > such that maxi<j<k |P0 T (A,-) — k n /n\ < e. Then clearly P 9t belongs to 
Vk, s (P n ) and V nj£ (P n ) is included in Vk,2e (Pe T ) ■ Therefore for any 8 it holds 

,2e (-F*0 T ),Po)< KL Vk (V k ,e (Pn) , Pe) < KL Vk (P 6t , P e ) (2.13) 

which proves that infg KL-p k (Vk, s (Pn) , Pe) = with attainment on 9' such that Pqi and 
Pq t coincide on Vk- 

We now turn to the study of the RHS term in (12.131) . Introducing the likelihood 
divergence tp defined in (|2.9p leads 

KL Vk (P 6T ,P e ) = (KL m ) Vk (P e ,P 6T ) 

whence minimizing KL Vk (Pq t ,Pq) over 9 in amounts to minimizing the likelihood 
divergence 9 — » (KL m ) Vk (P e ,Pg T ) . Set therefore 

8LDP,v k ■= argmin ifL n (P dT ,P e ) = argmin (KL m ) v (P e ,Pe T ) ■ 

8 9 " 

Based on the a— field generated by Vk on S the dual form (11.31) of the Likelihood divergence 
pseudo distance (KL m ) Vk (P e ,Pg T ) yields 

argmin (KL m ) Vk (P e , P &T ) = argmin sup ^ tp ( ^ (A,) J P e (Aj) 

E [ jr ( A A p *t (^) ■ ( 2 - 14 ) 

B 3 ev k ^ 



with (p(x) = — \ogx + x — l and (tp)* (x) = — log (1 — x) . With the present choice for tp the 
terms in P v vanish in the above expression ; however we complete a full developement, 
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as required in more envolved sampling schemes. Now an estimate of 6t is obtained 
substituting Pg T by P n in (12.141) leading, denoting rij the number of JQ's in Aj 







LDP,V k 



argminsup £ £ {Aj )) P e (A 3 ) - £ ^ (£)* (y (Aj)\ . 



A,- en 



Letting n tend to infinity yields (recall that k = k n ) 



lim sup 

n— >oo „ 



E Aj , Vk $ ( % (A,) J - E Aj ev h ($* (AJ) ) Pe T (A,) 



- I $ [Z (*)J Ve (x) dx-f ($* [Z dP n 
w.p. 1 which in turn implies 

lim 6iDP,p k — Oml = 

n— >oo 

where Oml is readily seen to be the usual ML estimator of defined through 

n 

M L ■= argsup Y\pe PQ) • 



i=i 



3 Weighted sampling 

This section extends the previous arguments for weighted sampling schemes. We will 
show that the Maximum Likelihood paradigm as defined above can be extended for these 
schemes, leading to operational procedures involving the minimization of specific diver- 
gence pseudodistances defined in strong relation with the distribution of the weights. 

The sampling scheme which we consider is commonly used in connection with the 
bootstrap and is refered to as the weighted or generalized bootstrap, sometimes called wild 
bootstrap, first introduced by Newton and Mason [9]. The main simplification which we 
consider in the present setting lies in the fact that we assume that the weights W{ are 
i.i.d. while being exchangeable random variables in the generalized bootstrap setting. 

Let x\,...,x n be n independent realizations of n i.i.d. r.v's X\,...,X n with common 
distribution Pg T . It will be assumed that 



For all in 0, EgX and EgX 2 are finite. 



(3.1) 



This entails that both 



1 n 1 n 

EXi and — 
n J 



•i'7 



n 



i=l 



i=l 



converge Pg T — a.e. to Eg T X and Eg T X 2 respectively; also the same holds with Ot substi- 
tuted by any in when x\, ...,x n is sampled under Pg. This assumption is necessary 
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when studying the properties of the estimates of 9t and of <fi (6?, 9) under some alternative 
9. 

Consider a collection Wi, W n of independent copies of W, whose distribution sat- 
isfies the conditions stated in Section 1. The weighted empirical measure P^ is defined 
through 

This empirical measure need not be a probability measure, since its mass may not equal 
1. Also it might not be positive, since the weights may take negative values. The measure 
P^ converges almost surely to Pq t when the weights W^s satisfy the hypotheses stated in 
Section 1. Indeed general results pertaining to this sampling procedure state that under 
regularity, functionals of the measure P^ are asymptotically distributed as are the same 
functionals of P n when the X^s are i.i.d. Therefore the weighted sampling procedure 
mimicks the i.i.d. sampling fluctuation in a two steps procedure: choose n values of Xi 
such that they asymptotically fit to Pg T , which means 

1 n 

lim - y~] 5 Xi = Pg T 

i=l 

deterministically and then play the W^s on each of the x^s. Then get P^ , a proxy to 
the random empirical measure P n . 

For any 9 in consider a similar sampling procedure under the weights W[ 's which 
are i.i.d. copies of the W^s. Let therefore xij, x n $ denote n i.i.d. realizations of 
Xi t g, ...,X n> g with distribution P e yielding the empirical measure 

1 - 

P w ._ ivr^ 



the corresponding empirical measure. Note that except for the choice of the generating 
measure Pe , P% is obtained in the same way as P^ ■ The ML principle turns out to 
select the value of 9 making P^ e ' as close as possible from P^ , conditionally upon . 

The resulting estimates are optimal in many respects, as is the classical ML estimator 
for regular models in the i.i.d. sampling scheme. The proposal which is presented here 
also allows to obtain optimal estimators for some non regular models. This approach is in 
line with [3] who developped a whole range of first order optimal estimation procedures 
in the case of the i.i.d. sampling, based on divergence minimization. 

Using the notations of section [1.1. 31 we endow M,(S) with r -topology rather than the 
weak topology, and define accordingly the cx-field B{M) on M.[S). Denote by M.i{S) the 
space of probability measure on S, endowed with the r — topology. 
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3.1 A Sanov conditional theorem for the weighted empirical 
measure 

The procedure which we are going to develop can be stated as follows. 

Similarly as in the simple i.i.d. setting select some (small) neighborhood V £ (P J J y )of 
and define the MLE of 9t as the value of 9 which optimizes the probability that the simu- 
lated empirical measure belongs to V e (P™) . This requires a conditional Sanov type 
result, substituting Lemmas 12. Il andl 2. 21 This result is produced in Theorem 13. II in Section 
13.11 In the same vein as in Lemmas 12. II and 12 .'2\ maximizing in 9 this probability amounts 
to minimizing a LDP rate between Pg and V e (Pq t ) ■ The rate is in strong relation with 
the distribution of the Wi's. Call it <\> w (V e (Pe T ) , Pe) ■= inf {<P W (Q, Pe) ,Q e V e (Pe T )} . 

Since e is small, this rate is of order (fi w (P$ T , Pe) ; this is Corollary 13. II in Section I3TT1 
Turn to the original data and estimate (f> w (Pg T , Pe) by some plug in method to be stated 
in Section 13.21 Define the ML estimator of 9t through the minimization of the proxy of 
4> w (Pq t , Pe) ■ We will prove that minimum divergence estimators play a key role in this 
setting. 

In order to state our conditional Sanov theorem we put forwards the following lemma, 
which is in the vein of Theorem 2.2 of Najim [10] which states the Sanov large deviation 
theorem, where the weights are i.i.d random variables. Trashorras and Wintenberger [H] 
have investigated the large deviations properties of weighted (bootstrapped) empirical 
measure with exchangeable weights under appropriate assumptions of the weights. Both 
papers equip JA(S) with the weak topology. 

The lemma's proof is defered to Section [71 



Lemma 3.1. Assume that Pe{U) > for any non- empty open set U G S, and that 
lim n ^ooP n = lim^oo - Y^7=i ^ = Pe £ -Mi(S), where the convergence holds under To. 
Then P^ e satisfies the LDP in (JA(S), B(Ai)) equipped with the r§-topoloqy with the good 
convex rate function: 

<j> w ((,P e )= sup {/ /(#)-/ M(f(x))P e (dx)\ 

_ f J Rd M*(-§^)dP e , ifC is a.c. w.r.t. P e 

loo, otherwise 

where M*(x) = sup t tx — M(t) for all real x and M(t) is the moment generating function 
ofW. 



Let Vk = (Ai, Ak) denote an arbitrary partition of S with Aj in B(S) for all 
: 1, k , and define the pseudometric d-p k on Ai(S) by 

d Vk (Q,R) = max \Q(Bj) - R(B 3 )\, Q,Re M(S). 

l<j<k 
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For any positive e, let 

V e (P?) = {Qe M(S) : d Vk (Q,P™) < e } 

denote an open neighborhood of the weighted empirical measure P^ in the To -topology. 
Then we have the following conditional LDP theorem. 

Theorem 3.1. With the above notation and assuming that Pg T is absolutely continuous 
with respect to Pg, for any positive e, the following conditional LDP result holds 

hm -logP.fe G V € (P?)\P n ) = -<f> w (V e (Pg T ) : Pg). 
Proof. In the following proof, Vk is an arbitrary partition on S. 

Po(pZ E V e (P^)\P n ) = Pe{dv k {Pn^ P n) < APn) 

> P e (d Vk (P^ e , Pe T ) + d Vk (Pg T , P™) < e|P n J 
= Pg (d Vk (P™g ,Pg T )<e - d Vk (Pg T , P™) \P n j . 

Since d-p k (Pg T , P™) —> when n — > oo, for any positive 5 and sufficiently large n we have: 

Pe (PZ G V e {P?)\P n ) > Pg (d Vk (P#, Pg T )<e-5)=Pg (PZ G . 

By Lemma I3.1[ we obtain the conditioned LDP lower bound 

liminf - log P 6 (pZ G V t {P™)\P n ) > -(j) W (V e _ S (Pg T ),Pg), 

In a similar way, we obtain the large deviation upper bound 

P 6 (pZ G V t (P?)\P n ) = Pe(dv k (PZ> P n) < e\P n ) 

< Pg (d Vk (P^g , Pg T ) - d Vk (Pg T , P™) < e\P n J 

< Pg(d Vk {PZ,Pe T ) <e + 5')= Pg(F>Z G V e+5 ,(Pg T )), 
for some positive 5'. We thus obtain 

limsU P -bgP fl (p$ E V £ (P™)\P n ) < -(f> W {V e+5 ,{Pg T ),Pg). 
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Let 5" = max(5, 5'), we have 

-4> w (V e - S "(Pe T ),Pe) < liminf- log P e (p^ G V e {P?)\P n 

rwoo n \ 



< lim sup - log P„(p$ G )\Pn) < ~<i> W {V e+5 ..{Pe T ),Pe) 



Denote c/ ro (V^(P6» T )) the closure of the open set V e (Pg T ) in the ro-topology, and note 5" is 
arbitrarily small, then it holds 



(V e (P 0T ),Pg) < liminf- log P q (pZ G U e (Pf )|P n 



< lim sup - log P e (p%e e V e (P^)\P n ) < -<j> w ' (d T0 (V e (Pg T )) , Pg) . 
It remains to show that 

<j> w {V e {Pg T ),P e ) = <p w (cl T0 (V e (P dT )),P e ). (3.2) 

Since Pg T is absolutely continuous with respect to Pg, by Lemma I3TT1 we have 

<P w (cl TO (V e (Po T ))iPe) < <p w (V e (P dT ),P e ) < <f> w (Pg T ,Pg) < oo. (3.3) 

Given some small positive constant u, then there exists \i G cl TQ (V e (Pg T )) satisfying 

<t> w (n,Po) < <t> w {cl T0 (V e (P eT )),P e )+LO. 

Set v G V 6 (Pq t ), and define z(a) = afi + (1 — a)i>, where < a < 1. Obviously, we have 
z(a) G V^(Pg T ). By Lemma I3TT| the map ( — > 0(£, Pg) is convex, hence we get 

<b w (V e (Pg T ), Pg) < lim (f> w {z(a), Pg) < lim (a<f> w (fi, Pg) + (1 - a)<f> w (v, Pg)) 

= <^(//,P fl ) < (f> W (cl T0 (V e (Pg T )),Pg)+U, (3.4) 

where the equality holds since 4> w (v,Pg) is finite by (13.31) . Combine f)3.3p with (13.41) to 
get (13.21) . This proves the conditional large deviation result. □ 

Using the above theorem, we obtain the following corollary. 



Corollary 3.1. Under the assumptions of Theorem EH} it holds 

Yim<t> w (V e {Pg T ),Pg) = (f>(Pe T ,Pe). 

e— >0 



Proof. By Lemma l3.1[ the rate function <f) w (fi,Pg) is a good rate function, hence it is 
lower semi-continuous; this implies 

\un<p W (V t (Pg T ),Pg) > <P(Pg T ,Pg). (3.5) 

For any e > 0, we have (j) w (Pg T , Pg) > 4> w (V e (Pg T ) , Pg); this together with (13. 5p completes 
the proof. □ 
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3.2 Divergences associated to the weighted sampling scheme 

For any Q in V e (Pg T ) rewrite the good rate function using the divergence notation 

from which <fi w (Q, Pg) is the divergence associated with the divergence function ip w := 
M*. 

Commuting Pg T and Pg in (I3.6P and introducing the conjugate divergence function 
ip w yields 

<f> W (Q, Pe) = Jv W dP e = dQ = ^(P e , Q). (3.7) 

By Theorem 13 .![ maximizing Pg[P^' e e V e (P^ v )\P n ) amounts to minimize <fi w {V e (Pg T ) , Pe) 
A final approximation now yields the form of the criterion to be estimated in order to 
define the MLE in the present setting. As e -> the asymptotic order of <fi w (V e (Pg T ) , Pe) 

is equal to (f> w (Pg, Pg T ) by Corollary 13.11 and (13.71) . which is a proxy of <j) w (Pg T , Pg) and 
therefore the theoretical criterion to be optimized in 9. 

We now state the dual form of the theoretical criterion 4> w (Pg,Pg T ) using the dual 
form ([LSD and 03) ■ It holds 



with 



cj) w (Pg,Pg T ) = sup / h(6,a,x)dPg T (x) (3.8) 

a&A J 

We now turn to the definition of the MLE in this context, estimating the criterion and 
deriving the estimate. 

3.3 MLE under weighted sampling 

Using the dual representation of divergences, the natural estimator of <fi(Pg, Pg T ) is 

MPg,Pg T ) := sup ( / 1(6, a, x) dP^{x)\ . (3.9) 

a&U U ) 

From now on, we will use <j)(6, 9t) to denote <f)(Pg, Pg T ); whence the resulting estimator 
of 4>(9t, Ot) is 

0J0 T e T ) := inf 4^(0, 6 T ) = inf sup j [ h(9, a, x) dP™(x) 
see a£U [J 
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and the resulting MLE of 9t is obtained as the minimum dual 4> w estimator 

9ml w '■= arg inf sup < h(9,a,x) dP^(x) > . (3.10) 
e £®aeu I J ) 

Formula f 1 3 . 1 j) indeed defines a Maximum Likelihood estimator, in the vein of (12. ip 
and (12. lip . This estimator requires no grouping nor smoothing. 



4 Bahadur slope of minimum divergence tests for 
weighted data 

Consider the test of some null hypothesis HO: 6t = versus a simple hypothesis HI 
9t = 9' . 

We consider two competitive statistics for this problem. The first one is based on the 
estimate of (f> w (P a , Pp) defined for all (a, f3) in 6 x 6 through 

where the i.i.d. sample Xi, ...,X n has distribution Pp. The test statistics T n (9) converges 
to under HO. 

A competitive statistics if} (9) writes 

${0) :=it>(9,P?) 

where Q — >■ ^ (9, Q) is assumed to satisfy if} (9, Pg) = , and is r— continuous with respect 
to Q, which implies that under HO the following Large Deviation Principle holds 

lim -logP e ($(6) > t) = -I(t) (4.1) 

= -inf{^ (P e ,Q),if}(9,Q)>t} 
for any positive t. Also we assume that under B.1, if> (9) converges to if> (9, Pgi) 

lim i}{9) = e ,iP(9,P e ,) (4.2) 

rn>oo 

where (14. 2p stands in probability under 9'. 

We now state the Bahadur slope of the test (f> w (9, 9) . 
Under HO 

lim - log P e (T n (9) >t) = -2 inf U w (P e , Q) , 4> w (Q, P e ) > t) 

= -2 inf {<f) W (Pg, Q) , <p w (Pg, Q) > t} 

= -It 
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while, under HI 

lim T n (9) = <p w (Pg, P B t) in probability 

n—^oo 

since P^ converges weakly to Pg/. 

It follows that the Bahadur slope of the minimum divergence test (fi w (9, 9) is 

e Tn{ g ) = -2 < j> w (Pg,Pg,). 

Let us evaluate the Bahadur slope of the test ip (9) . 
Following flU]) and fl3~2D it holds 

e m = -2 inf {cf> w (Pg, Q) , ^ (0, Q) > V (9, Pg,)} . 

Since inf {<f) w (Pg, Q) , tp (9 , Q) > (9, Pg,)} < <\> w (Pg, Pg,) it follows that e^ 0) < e Tn(e) . 
We have proved 

Proposition 4.1. Under the weighted sampling the test statistics ip (9) is Bahadur effi- 
cient among all tests which are empirical versions of r — continuous functionals . 



5 Weighted sampling in exponential families 

In this short section we show that MLE's associated with weighted sampling are specific 
with respect to the weighting; this is in contrast with the unweighted sampling (i.i.d. sim- 
ple sampling), under which all minimum divergence estimators coincide with the standard 
MLE; see pQ. 
Let 

p e (x) = exp \9t(x) - C(9)} dfi(x) (5.1) 

be an exponential family with natural parameter 9 in an open set in M. d , and where 
fj, denotes a common dominating measure for the model. We assume that this family is 
full i.e. that the Hessian matrix (d 2 /d9 2 )C(9) is definite positive. Recall that under the 
standard i.i.d. X±, ...,X n sampling the MLE 9ml of 9 satisfies 

1 n 

VC(9)g ML = -Y,t(X l ). 

i=l 

Under the weighted sampling W\ , W n corresponding to the divergence function ip w , 
conditionally on the observed data X\, ...,x n the MLE writes 

W^argWsup/ p)'(£)^-/ p) # 
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We prove that 8ml,w satisfies 

1 " 

VC(0) W = -£Wit(s,). 

i=l 

Denote 

M„(^):=/P)'(g^-/P) # (g^. 
Clearly, subsituting using (15. ip it holds for all 9 

inf supM n (#,a) > M n {9,9) = 0. (5 

We prove that M n (#ML,iy ; °) is maximal for a = 9 MLjW which closes the proof. 

Let Xi, ...,X n be n i.i.d. random variables with common distribution Pq t with 9t 
6. Introduce 



M n (9,a) := / <p 



f fdPe\i P If #( dP o lY 



We prove that 

« = 9ml,w is the unique maximizer of M n (9ml,W: a ) (5 

which yields 

inf sup M n (0, a) < sup M n (0 ML ,w, «) = (9 MLtW , 9 M l,w) = (5 

"a a 

which together with (15.21) completes the proof. 
Define 

M n> i (9, a) := J ip' (exp A(9, a, x)) exp B (9, x) d\(x) 
1 " 

M nj2 (9, a):= Wi exp (A (9, a, a*)) (exp A(0, a, a*)) 



n 

i=l 



M n , 3 (0, a) 



1 " 

-V^ (exp -4(0, a, Xi)) 



t=l 
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with 



It holds 
with 



d_ 

da 



A(6, a, x) := T(x)' (6-a) + C(a) - C(6) 
B(6,x) :=T(x)'9 -C(9). 

M n (0, a) = M nA (6, a) - M n>2 (0, a) + M n>3 (6, a) 
M n ,i (6, a) a=e = (1) [VC (6) - VC (a) a=e ] = 



for all 0, 



n ^— ' 



-T( Xl )+VC (a 



a —t>ML,W 



1=1 



and 



d 1 

M n , 3 {0 M l,w, a) = -J2Wi 
n z — ' 

i=l 



da 



-T(xi) + VC (a 



a —t>ML,W 



where the two last displays hold iff a = 6ml- Now 
2 



da 2 
da 2 



M n ,2(e,a) a=dA 



da- 



;M n , 3 {9 M L,a) a=dA 



(<p®(l) + 2<pW(l)) (d 2 /d6 2 )C(6 ML , w ) 
(^ 3 >(1) + V 2 >(1)) {d 2 /d6 2 )C(6 ML 

^ 2 \i)(d 2 /de 2 )c(e ML , w ), 



w 



whence 



-^M n (6,a) a=eML w = -^(1) (d 2 /d6 2 ) C(9 ML>W ) 

which proves (15.31) . and closes the proof. 

In contrast with the i.i.d. sampling case minimum divergence estimators in exponential 
families under appropriate weighted sampling do not coincide independently upon the 
divergence. 
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6 Weak behavior of the weighted sampling MLE's 



The distribution of the estimator is obtained under the sampling scheme which determines 
its form. Hence under the weighted sampling one. So the observed sample x\,...,x n is 
considered non random, and is assumed to satisfy 



lim - y~] 5 Xi = Pg T 

i=l 



and randomness is due to the set of i.i.d. weights Wi, W n . 

All those estimators can be written as approximate linear functionals of the weighted 
empirical measure . Therefore all the proofs in [3] can be adapted to the present esti- 
mators. Even the asymptotic variances of the estimators are the same, and subsequently, 
Wilk's tests , confidence areas, minimum sample sizes certifying a given asympptotic 
power, etc, remain unchanged. The only arguments to be noted are the following: All 
arguments pertaining to laws of large numbers for functionals of the empirical measure 
carry over to the present setting, conditionally on the observations Xi,...,x n . Indeed 
consider a statistics 



where the function / satisfies 



and 



U n :=-J2Wif(xi 



1 " 

lim - V] f(xi) = /ii j < oo 



1 

lim - Y] f 2 (xi) = H2 f < oo. 

Then clearly 



1=1 



lim EU n = nij 

n—>oo 

and 

lim VarU n = /i 2 ,/ - (a*i,/) 2 • 

n— ¥oo 

Weak behavior of the estimates follow also from similar arguments: Consider for example 
the statistics 

T n := y/n (U n - h-lj) / \J Hzj ~ {v>i,ff ■ 
Using Lindeberg Central limit theorem for triangular arrays , we obtain that T n is asymp- 
totically standard normal conditionally upon x±, x n . It follows that the limit distribu- 
tions of <p w (9, 6 T ) and of 8ml,w conditionally on X\, ...,x n coincide with those of <p n (9, 6t) 
and of 9 n as stated in [5] under the i.i.d. sampling. Also all results pertaining to tests of 
hypotheses are similar, as is the possibility to handle non regular models. 
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7 Proof of Lemma 



EL 



Proof. Recall that B(S) denotes the class of all bounded measurable functions on S. 
Write B'(S) as the algebraic dual of B(S). We equip B'(S) with i?(S)-topology, it is the 
weakest topology which makes continuous the following linear functional: 

C ^< /, C >: B'(S) -> R, for all / in B(S), 

where < /, ( > denotes the value of /(C)- It follows that M(S) is included in B'(S) and 
is endowed with the r -topology induced by B(S). Construct the projection: Pf lt ...j m '■ 
B'(S) ^R m ,meZ + , namely p flt ... tfm (Q = (< f u ( >,...,< f m , ( >), f u ...f m G B(S). 
Then for Pf u ...j m {P^ g ) = (< fx, P™o >, < fm, P^e >) we define the corresponding limit 
logarithm moment generating function as follows 



- ^ m n 

h(t) := lim -logE(exp(n < t,Y m >)) = lim -logE(exp(V < ^Tf^ >)) 

n->-oo n n— >oo fl L — ' 

j=l i=l 

■t n / m \ „ / m \ 

= j™ -^logEexp ^^-(xOWiJ = J y£ u(ljfj)j dPe 

where < = (ti, t m ) G K m and F m = (< f h P^ >,...,< / m , P^J >). The function h(t) is 
finite since / G B(S). M(f) is Gateaux-differentiable since the function s — > M(/ + sg») 
is differentiable at s = for any /, g G -B(S') 

d u / fJ \ I fge f dP w 

where Py^ is the law of W. Further, the Gateaux-differentiability of M(f) together 
with the interchange of integration and differentiation justified by dominated convergence 
theorem show that h(t) is also Gateaux-differentiable in t = (t±, ...,t m ). Hence by the 
Gartner-Ellis Theorem (see e.g. Theorem 2.3.6 of [6]), P/i,.„,/ m (^Ve) satisfies the LDP in 
M m with the good rate function 



$/i,...,/ ro (< A,C >,•••,< fm,C >) = sup 

ti,...,t m ei 



m r / m \ 

{5> < /*,c > - / m dp o] 

i=l •> \i=l J 



< sup *f{<f,C >):=<!>* (CPs). (7.1) 

feB(S) 

Since m is arbitrary positive integer, by Dawson-Gartner's Theorem (see e.g. Theorem 
4.6.1 of [6]), P^ e satisfies the LDP in B ; (S) with the good rate function <p w ((, Pg), which 
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is: 



4> w ((,P e )= sup $/(</, C>) = sup { / f(x)C(dx) - [ M(f)P e (dx)\ 
feB(S) feB{S) 1 Js Js J 

m> ( m 



note that B'(S) is endowed with the To-topology, the proof of last equality is given be- 
low. Here we always assume £ is absolutely continuous with respect to Pg, otherwise 
cj) W (C,Pe) = oo. Consider M(S) C B'(S), and set <j) w (C,Pe) = oo when C £ M(S). 
Hence P^ 9 satisfies the LDP in M(S) with the rate function (j) w ((,P ), for ( e M(S). 
As mentioned before, Jvl(S) is endowed with the topology induced by B'(S), namely the 
r -topology. Nown we give another representation of the rate function (p w ((,P e ). We 
have: 



SUp { / f(x)((dx) - f M* dPg] 

^JL{L ,d< - M iM dp °} i L M(,)iPt ' 



(&M(S) 

where the inequality holds from the duality lemma and when d( = (dPg)M'(f) the equality 
holds. Using once again the duality lemma, we obtain the following identity: 

/ M*(-^-)dP e = sup { / f(x)((dx)- [ M(f)dP e \= ( j ) w (CP e ). 
Js \a^ej (gM(S)^Js Js J 

The convexity of the rate function £ — > <p w ((, Pg) holds from Theorem 7.2.3 of [6] where 
they show the convexity of <p w (£, Pg) on Ai(S) endowed with 5(5')-topology Hence this 
is also applied to r -topology which is induced by B (^-topology. This completes the 
proof of the lemma. □ 

Remark 7.1. By the classical Gartner-Ellis Theorem, in (17. ip . the essential smoothness 
of h(t) is needed for $/ lv ..j m to be a "good rate function". But on a locally convex Haus- 
dorff topological vector space, the essential smoothness of h(t) can be reduced to Gateaux 
differentiability; see Corollary 4.6.14 (page 167) and the proof Theorem 6.2.10 (page 2Q5) 
of®. 

Remark 7.2. Since $/i,...,/ m (< /i,C >>•••>< fm,( >) is a good rate function in M. m , its 
level sets $^ ... Jm (a) = {(yi, -,y m ) e R m : (j/i, ...,y m ) < a} are compact, for all 

a in [0, oo). Denote the projective limit of ^J^ fm (a) by ^^(a) = hjn^^ j m (a) . Ac- 
cording to Tychonoff's theorem, the projective limit ^J 1 (a) of the compact set &J-^ f m ( a ) 
is still compact, so (fi w (£, Pg) = supj eB( - S ) $/(< /, C >) also a good rate function in 
(M(S),B(M)). 
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